Need to create new variables or summaries or rename the variables or summaries.
library(nycflights13)
library(tidyverse)
?flights
nycflights13::flights
View(flights)
This functions allows you to subset observations based on their values * the first argument is the name of the data frame * second is a subsequent arguments are the expressions that filter the data frame
Example:
filter(flights, month == 1, day == 1)
Running this line of code makes dplyr execute the filtering operation and returns a new data frame. * dplyr functions never modify their inputs so if you want to save the result, you’ll need to use the assignment operator: <-
jan1 <- filter(flights, month == 1, day == 1)
R can either print the variable or save them to a variable. To do both you need to:
(dec25 <- filter(flights, month == 12, day == 25))
In order to use filtering effectively you need to know how to select the observations that you want using the comparisons operators. * Don’rely on == for an approximation use near()
near(sqrt(2) ^ 2, 2)
[1] TRUE
near(1 / 49 * 49, 1)
[1] TRUE
Using Boolean operators for the types of combinations.
filter(flights, month == 12 | month == 12)
nov_dec <- filter(flights, month %in% c(11, 12))
Use the code above to select every row where x is one of the values in y. Sometimes you can simplify complicated subsetting by remembering de Morgan’s law: 1(x & y) is the same as !x | !y and !(x | y) is the same as !x & !y. * if you wanted to find flights that weren’t delayed (on arrival or departure) by more then 2 hours, you can use either of the followinf 2 filter:
filter(flights, !(arr_delay > 120 | dep_delay > 120))
filter(flights, arr_delay <= 120, dep_delay <= 120)
Use is.na() to determine if a value is missing.
is.na(nov_dec)
year month day dep_time
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE
[19,] FALSE FALSE FALSE FALSE
[20,] FALSE FALSE FALSE FALSE
[21,] FALSE FALSE FALSE FALSE
[22,] FALSE FALSE FALSE FALSE
[23,] FALSE FALSE FALSE FALSE
[24,] FALSE FALSE FALSE FALSE
[25,] FALSE FALSE FALSE FALSE
[26,] FALSE FALSE FALSE FALSE
[27,] FALSE FALSE FALSE FALSE
[28,] FALSE FALSE FALSE FALSE
[29,] FALSE FALSE FALSE FALSE
[30,] FALSE FALSE FALSE FALSE
[31,] FALSE FALSE FALSE FALSE
[32,] FALSE FALSE FALSE FALSE
[33,] FALSE FALSE FALSE FALSE
[34,] FALSE FALSE FALSE FALSE
[35,] FALSE FALSE FALSE FALSE
[36,] FALSE FALSE FALSE FALSE
[37,] FALSE FALSE FALSE FALSE
[38,] FALSE FALSE FALSE FALSE
[39,] FALSE FALSE FALSE FALSE
[40,] FALSE FALSE FALSE FALSE
[41,] FALSE FALSE FALSE FALSE
[42,] FALSE FALSE FALSE FALSE
[43,] FALSE FALSE FALSE FALSE
[44,] FALSE FALSE FALSE FALSE
[45,] FALSE FALSE FALSE FALSE
[46,] FALSE FALSE FALSE FALSE
[47,] FALSE FALSE FALSE FALSE
[48,] FALSE FALSE FALSE FALSE
[49,] FALSE FALSE FALSE FALSE
[50,] FALSE FALSE FALSE FALSE
[51,] FALSE FALSE FALSE FALSE
[52,] FALSE FALSE FALSE FALSE
sched_dep_time dep_delay
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
[5,] FALSE FALSE
[6,] FALSE FALSE
[7,] FALSE FALSE
[8,] FALSE FALSE
[9,] FALSE FALSE
[10,] FALSE FALSE
[11,] FALSE FALSE
[12,] FALSE FALSE
[13,] FALSE FALSE
[14,] FALSE FALSE
[15,] FALSE FALSE
[16,] FALSE FALSE
[17,] FALSE FALSE
[18,] FALSE FALSE
[19,] FALSE FALSE
[20,] FALSE FALSE
[21,] FALSE FALSE
[22,] FALSE FALSE
[23,] FALSE FALSE
[24,] FALSE FALSE
[25,] FALSE FALSE
[26,] FALSE FALSE
[27,] FALSE FALSE
[28,] FALSE FALSE
[29,] FALSE FALSE
[30,] FALSE FALSE
[31,] FALSE FALSE
[32,] FALSE FALSE
[33,] FALSE FALSE
[34,] FALSE FALSE
[35,] FALSE FALSE
[36,] FALSE FALSE
[37,] FALSE FALSE
[38,] FALSE FALSE
[39,] FALSE FALSE
[40,] FALSE FALSE
[41,] FALSE FALSE
[42,] FALSE FALSE
[43,] FALSE FALSE
[44,] FALSE FALSE
[45,] FALSE FALSE
[46,] FALSE FALSE
[47,] FALSE FALSE
[48,] FALSE FALSE
[49,] FALSE FALSE
[50,] FALSE FALSE
[51,] FALSE FALSE
[52,] FALSE FALSE
arr_time sched_arr_time
[1,] FALSE FALSE
[2,] FALSE FALSE
[3,] FALSE FALSE
[4,] FALSE FALSE
[5,] FALSE FALSE
[6,] FALSE FALSE
[7,] FALSE FALSE
[8,] FALSE FALSE
[9,] FALSE FALSE
[10,] FALSE FALSE
[11,] FALSE FALSE
[12,] FALSE FALSE
[13,] FALSE FALSE
[14,] FALSE FALSE
[15,] FALSE FALSE
[16,] FALSE FALSE
[17,] FALSE FALSE
[18,] FALSE FALSE
[19,] FALSE FALSE
[20,] FALSE FALSE
[21,] FALSE FALSE
[22,] FALSE FALSE
[23,] FALSE FALSE
[24,] FALSE FALSE
[25,] FALSE FALSE
[26,] FALSE FALSE
[27,] FALSE FALSE
[28,] FALSE FALSE
[29,] FALSE FALSE
[30,] FALSE FALSE
[31,] FALSE FALSE
[32,] FALSE FALSE
[33,] FALSE FALSE
[34,] FALSE FALSE
[35,] FALSE FALSE
[36,] FALSE FALSE
[37,] FALSE FALSE
[38,] FALSE FALSE
[39,] FALSE FALSE
[40,] FALSE FALSE
[41,] FALSE FALSE
[42,] FALSE FALSE
[43,] FALSE FALSE
[44,] FALSE FALSE
[45,] FALSE FALSE
[46,] FALSE FALSE
[47,] FALSE FALSE
[48,] FALSE FALSE
[49,] FALSE FALSE
[50,] FALSE FALSE
[51,] FALSE FALSE
[52,] FALSE FALSE
arr_delay carrier flight
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
[5,] FALSE FALSE FALSE
[6,] FALSE FALSE FALSE
[7,] FALSE FALSE FALSE
[8,] FALSE FALSE FALSE
[9,] FALSE FALSE FALSE
[10,] FALSE FALSE FALSE
[11,] FALSE FALSE FALSE
[12,] FALSE FALSE FALSE
[13,] FALSE FALSE FALSE
[14,] FALSE FALSE FALSE
[15,] FALSE FALSE FALSE
[16,] FALSE FALSE FALSE
[17,] FALSE FALSE FALSE
[18,] FALSE FALSE FALSE
[19,] FALSE FALSE FALSE
[20,] FALSE FALSE FALSE
[21,] FALSE FALSE FALSE
[22,] FALSE FALSE FALSE
[23,] FALSE FALSE FALSE
[24,] FALSE FALSE FALSE
[25,] FALSE FALSE FALSE
[26,] FALSE FALSE FALSE
[27,] FALSE FALSE FALSE
[28,] FALSE FALSE FALSE
[29,] FALSE FALSE FALSE
[30,] FALSE FALSE FALSE
[31,] FALSE FALSE FALSE
[32,] FALSE FALSE FALSE
[33,] FALSE FALSE FALSE
[34,] FALSE FALSE FALSE
[35,] FALSE FALSE FALSE
[36,] FALSE FALSE FALSE
[37,] FALSE FALSE FALSE
[38,] FALSE FALSE FALSE
[39,] FALSE FALSE FALSE
[40,] FALSE FALSE FALSE
[41,] FALSE FALSE FALSE
[42,] FALSE FALSE FALSE
[43,] FALSE FALSE FALSE
[44,] FALSE FALSE FALSE
[45,] FALSE FALSE FALSE
[46,] FALSE FALSE FALSE
[47,] FALSE FALSE FALSE
[48,] FALSE FALSE FALSE
[49,] FALSE FALSE FALSE
[50,] FALSE FALSE FALSE
[51,] FALSE FALSE FALSE
[52,] FALSE FALSE FALSE
tailnum origin dest air_time
[1,] FALSE FALSE FALSE FALSE
[2,] FALSE FALSE FALSE FALSE
[3,] FALSE FALSE FALSE FALSE
[4,] FALSE FALSE FALSE FALSE
[5,] FALSE FALSE FALSE FALSE
[6,] FALSE FALSE FALSE FALSE
[7,] FALSE FALSE FALSE FALSE
[8,] FALSE FALSE FALSE FALSE
[9,] FALSE FALSE FALSE FALSE
[10,] FALSE FALSE FALSE FALSE
[11,] FALSE FALSE FALSE FALSE
[12,] FALSE FALSE FALSE FALSE
[13,] FALSE FALSE FALSE FALSE
[14,] FALSE FALSE FALSE FALSE
[15,] FALSE FALSE FALSE FALSE
[16,] FALSE FALSE FALSE FALSE
[17,] FALSE FALSE FALSE FALSE
[18,] FALSE FALSE FALSE FALSE
[19,] FALSE FALSE FALSE FALSE
[20,] FALSE FALSE FALSE FALSE
[21,] FALSE FALSE FALSE FALSE
[22,] FALSE FALSE FALSE FALSE
[23,] FALSE FALSE FALSE FALSE
[24,] FALSE FALSE FALSE FALSE
[25,] FALSE FALSE FALSE FALSE
[26,] FALSE FALSE FALSE FALSE
[27,] FALSE FALSE FALSE FALSE
[28,] FALSE FALSE FALSE FALSE
[29,] FALSE FALSE FALSE FALSE
[30,] FALSE FALSE FALSE FALSE
[31,] FALSE FALSE FALSE FALSE
[32,] FALSE FALSE FALSE FALSE
[33,] FALSE FALSE FALSE FALSE
[34,] FALSE FALSE FALSE FALSE
[35,] FALSE FALSE FALSE FALSE
[36,] FALSE FALSE FALSE FALSE
[37,] FALSE FALSE FALSE FALSE
[38,] FALSE FALSE FALSE FALSE
[39,] FALSE FALSE FALSE FALSE
[40,] FALSE FALSE FALSE FALSE
[41,] FALSE FALSE FALSE FALSE
[42,] FALSE FALSE FALSE FALSE
[43,] FALSE FALSE FALSE FALSE
[44,] FALSE FALSE FALSE FALSE
[45,] FALSE FALSE FALSE FALSE
[46,] FALSE FALSE FALSE FALSE
[47,] FALSE FALSE FALSE FALSE
[48,] FALSE FALSE FALSE FALSE
[49,] FALSE FALSE FALSE FALSE
[50,] FALSE FALSE FALSE FALSE
[51,] FALSE FALSE FALSE FALSE
[52,] FALSE FALSE FALSE FALSE
distance hour minute
[1,] FALSE FALSE FALSE
[2,] FALSE FALSE FALSE
[3,] FALSE FALSE FALSE
[4,] FALSE FALSE FALSE
[5,] FALSE FALSE FALSE
[6,] FALSE FALSE FALSE
[7,] FALSE FALSE FALSE
[8,] FALSE FALSE FALSE
[9,] FALSE FALSE FALSE
[10,] FALSE FALSE FALSE
[11,] FALSE FALSE FALSE
[12,] FALSE FALSE FALSE
[13,] FALSE FALSE FALSE
[14,] FALSE FALSE FALSE
[15,] FALSE FALSE FALSE
[16,] FALSE FALSE FALSE
[17,] FALSE FALSE FALSE
[18,] FALSE FALSE FALSE
[19,] FALSE FALSE FALSE
[20,] FALSE FALSE FALSE
[21,] FALSE FALSE FALSE
[22,] FALSE FALSE FALSE
[23,] FALSE FALSE FALSE
[24,] FALSE FALSE FALSE
[25,] FALSE FALSE FALSE
[26,] FALSE FALSE FALSE
[27,] FALSE FALSE FALSE
[28,] FALSE FALSE FALSE
[29,] FALSE FALSE FALSE
[30,] FALSE FALSE FALSE
[31,] FALSE FALSE FALSE
[32,] FALSE FALSE FALSE
[33,] FALSE FALSE FALSE
[34,] FALSE FALSE FALSE
[35,] FALSE FALSE FALSE
[36,] FALSE FALSE FALSE
[37,] FALSE FALSE FALSE
[38,] FALSE FALSE FALSE
[39,] FALSE FALSE FALSE
[40,] FALSE FALSE FALSE
[41,] FALSE FALSE FALSE
[42,] FALSE FALSE FALSE
[43,] FALSE FALSE FALSE
[44,] FALSE FALSE FALSE
[45,] FALSE FALSE FALSE
[46,] FALSE FALSE FALSE
[47,] FALSE FALSE FALSE
[48,] FALSE FALSE FALSE
[49,] FALSE FALSE FALSE
[50,] FALSE FALSE FALSE
[51,] FALSE FALSE FALSE
[52,] FALSE FALSE FALSE
time_hour
[1,] FALSE
[2,] FALSE
[3,] FALSE
[4,] FALSE
[5,] FALSE
[6,] FALSE
[7,] FALSE
[8,] FALSE
[9,] FALSE
[10,] FALSE
[11,] FALSE
[12,] FALSE
[13,] FALSE
[14,] FALSE
[15,] FALSE
[16,] FALSE
[17,] FALSE
[18,] FALSE
[19,] FALSE
[20,] FALSE
[21,] FALSE
[22,] FALSE
[23,] FALSE
[24,] FALSE
[25,] FALSE
[26,] FALSE
[27,] FALSE
[28,] FALSE
[29,] FALSE
[30,] FALSE
[31,] FALSE
[32,] FALSE
[33,] FALSE
[34,] FALSE
[35,] FALSE
[36,] FALSE
[37,] FALSE
[38,] FALSE
[39,] FALSE
[40,] FALSE
[41,] FALSE
[42,] FALSE
[43,] FALSE
[44,] FALSE
[45,] FALSE
[46,] FALSE
[47,] FALSE
[48,] FALSE
[49,] FALSE
[50,] FALSE
[51,] FALSE
[52,] FALSE
[ reached getOption("max.print") -- omitted 55351 rows ]
Ask expilcitly if you want missing values to be included.
filter(flights, (arr_delay > 120))
flew to houston
filter(flights, (dest == "IAH"))
filter(flights, (dest == "HOU"))
Works similar to filter() excepts it changes the order of the selected rows.
arrange(flights, year, month, day)
Use desc() to reoirder by a column in descending order
arrange(flights, desc(dep_delay))
Missing values are sorted at the end:
df <- tibble(x = c(5, 2, NA))
arrange(df, x)
arrange(df, desc(x))
df <- tibble(x = c(5, 2, NA),
y = c(2,NA, 2))
rowSums(df)
[1] 7 NA NA
arrange(df, desc(is.na(x)))
arrange(df, -(is.na(x)))
Code simply says, those which are ‘TRUE’ to being ‘NA’, then sort them in descending order.
arrange(flights, dep_delay)
arrange(flights, desc(dep_delay))
arrange(flights, air_time)
# Shortest flights
flights %>%
arrange(air_time) %>%
select(carrier, flight, air_time)
#Fastest flights
flights %>%
arrange(-air_time) %>%
select(carrier, flight, air_time)
NA
Narrowing variables you are actually interested in. Using this function allows you to rapidly zoom inon a useful subset using operations based on variable names.
Helper functions you can use within select()
** starts_with(abc): matches names that start with “abc”
ends_with : gives you the names that end with “xyz”
contains(“ijk”) : matches names that contain “ijk”
matches(“(.)\1”) : matches any variables that contain repeated characters
num_range(“x”, 1:3) : matches x1, x2, and x3
Use rename() to change variable names onstead of select()
rename(flights, tail_num = tailnum)